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A method and an apparatus for classification of a mixed 
speech and noise signal 



5 The invention concerns a method and an apparatus for 

classification of a mixed speech and noise signal as being 
significantly or insignificantly affected by the speech 
signal . 

10 The time intervals where the mixed signal is insignifi- 
cantly affected by the speech signal may be used for 
forming a running estimate of the noise signal with known 
methods, it being possible to suppress the noise on the 
basis of this estimate. 

15 

The invention may be used in electroacustic systems for 
transmission and signal processing of speech signals (e.g. 
mobile telephones, speech recognition systems and hearing 
aids), where it is endeavoured to eliminate or reduce de- 
20 gradation of speech quality , speech recognition and speech 
perception because of present background noise using noise 
suppressing and/or speech enhancing methods. 

Electroacustic systems for transmission and signal pro- 
25 cessing of speech signals exist in numerous types and for 
many different purposes. The expansive development in the 
field of digital electronics, including particularly the 
digital signal processors, has made it possible to employ 
a plurality of methods not practically useful before in 
30 connection with removing or suppressing, in real time, the 
background noise, which occurs either acoustically simul- 
taneously with the speech signal (e.g. in a helicopter 
cockpit where machine and rotor noise affects the acoustic 
communication from the pilot) or as an electric signal, 
35 equivalent therewith, in the transmission system itself. 
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Such methods are known from the literature and are called 
noise suppression or speech enhancement methods. Of these 
methods may be mentioned adaptive filtering and spectral 
subtraction. See e.g. (1) and (7). The aim of improving 
5 the signal /noise ratio (the ratio of speech signal magni- 
tude to noise magnitude) is that the methods are to 
counteract the degradation of the reception caused by the 
noise and the intelligibility of the transmitted speech 
signal. Several of the known methods are based on a run- 

10 ning estimate of the statistic characteristics of the 
background noise, e.g. intensity and frequency content. 
With a speech or pause detector time segments are identi- 
fied with and without speech signal, respectively, and in 
the segments exclusively containing background noise 

15 (speech pauses) the characteristics of the noise may be 

estimated by suitable signal analysis. Assuming a certain 
stationarity of the background noise this estimate may be 
used for adjusting the noise suppression or speech en- 
hancement method until the next time the noise can be 

20 estimated. 

Several methods are described in the literature for dis- 
tinguishing between voiced speech, unvoiced speech, and 
pauses, both without and with background noise. See e.g. 
25 (4), (5) and (8). (9) includes i.a. a survey of the most 
important methods which have been used for classification 
of speech, in particular in connection with speech recog- 
nition systems. 

30 In particular two of the known principles should be men- 
tioned: the energy histogram and valley detector prin- 
ciples. In a noise suppression method (3) use of the 
valley detector method is reported for pointing out the 
time intervals in which a mixed speech and noise signal 

35 exclusively consists of background noise (i.e. 

corresponding to pauses in the speech signal). In the 



WO 91/03042 



PCT/DK90/00214 



described invention the method is incorporated in a type 
of feedback loop by acting on the individual frequency 
bands of the output signal and with the purpose of 
increasing the field of use of the speech/noise detector. 

5 

However , none of the known speech and pause detectors are 
particularly robust when the speech signal is subjected to 
e.g. considerable reverberation, or when the background 
noise is added in a poor signal/noise ratio (less than 0 

10 dB) or has a speech-like nature, i.e. resembles the speech 
signal from one or more speakers. In these cases the 
detection will be less certain with known methods. It has 
been attempted to reduce this problem by using a priori 
knowledge about the speech and noise signals. It has thus 

15 been utilized in (1) and (2) that the amplitude 

fluctuations in speech and noise are different in certain 
cases. When, however, the noise is speech-like, this 
difference will be marginal. 

i 

20 So far, no speech detector has been developed which can 
operate reliably both with a poor signal/noise ratio and 
with speech-like noise. The object of the present inven- 
tion is therefore to provide a method and an apparatus 
where this problem is solved. 

25 

This object is achieved by the method stated in claim 1 
and the apparatus stated in claim 8, involving detection 
of the time segments in a mixed speech and noise signal 
which are dominated by the speech signal. This is to be 

30 understood in combination with well-known knowledge, which 
is described below, that a speech signal includes a plu- 
rality of time segments where the speech signal contri- 
butes only insignificantly to the mixed signal. Such seg- 
ments are not just speech pauses (between words and sen- 

35 tences, breathing), but in particular also very short in- 
tervals, typically within a word where the speech signal 
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assumes a value so that it just contributes insignifi- 
cantly to the mixed signal. These segments are detected, 
and it is possible- on the basis of this to update para- 
meters for the background noise. This is done with unpre- 
5 cedented frequency and can therefore form the basis for a 
considerably more precise estimate of the background 
noise. 

In a speech signal the energy can assume relatively great 
10 values in short time intervals , corresponding to some of 
the voiced sounds (e.g. the open vowels) as well as some 
of the consonants (the fricatives and the plosives). 
Therefore, the signal/noise ratio will be relatively great 
in time segments containing these speech sounds, and these 
15 segments are thus particularly useful for detecting pre- 
sence of speech in background noise. The reason why the 
energy is great in the mentioned speech sounds is the 
following: 

20 1) A vowel may be described as a (quasi) periodic time 
signal which in terms of frequency consists of a funda- 
mental frequency and its harmonics, whereby the speech 
energy simultaneously occurs in a larger frequency range. 

25 2) A fricative and/or a plosive may be described as a 
short, noise-like time signal where the energy simul- 
taneously occurs in a wide frequency range. 

In the preferred embodiment of the invention the frequency 
30 range of the speech signal is suitably divided into a plu- 
rality of frequency bands, and it thus applies that for 
each of the two types of speech sounds the energy occurs 
with a certain simultaneousness between the frequency 
bands. Further, it is special to the vowels that since the 
35 difference between two consecutive harmonic frequencies is 
always equal to the fundamental frequency for the speech 
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signal , the envelope of a frequency restricted subsignal 
containing two or more consecutive harmonic frequencies 
will always be periodic and substantially synchronous with 
the fundamental frequency, since the envelope represents a 
5 beat signal with a frequency equal to the difference be- 
tween the two harmonics, which is precisely equal to the 
fundamental frequency. Since it is the same frequency, 
viz* the fundamental frequency of the speech signal, for 
all the subsignals which causes the beat signal which is 
10 detected by envelopment, the envelopes of the subsignals 
will substantially be synchonous or correlated with each 
other. 

In order that this envelope, which is periodic with the 
15 fundamental frequency, can always be produced, it is ne- 
cessary that each subsignal has a frequency band width 
which always comprises at least two harmonic frequencies. 
This is obtained with a band width of at least twice the 
fundamental frequency. If the fundamental frequency is 
20 e.g. 220 Hz, the band width must at least be 440 Hz. 

It is well-known from the literature, see e.g. (3), to 
examine a mixed speech and noise signal by division into 
time intervals and by splitting into a number of sub- 

25 signals by means of a filter bank consisting of bandpass 
filters. However, in contrast to the previously described 
methods, this is done in a particular manner in the pre- 
sent invention, since the invention realizes a filter bank 
consisting of bandpass filters with a band width which is 

30 especially dependant upon general characteristics of the 
speech signal, as well as a detector utilizing the corre- 
lation between the envelopes of the subsignals. Moreover, 
and still in contrast to the previously described methods, 
the aim of the present invention is not to point out the 

35 time intervals in the mixed speech and noise signal which 
just consist of noise (i.e. corresponding to pauses in the 
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speech signal), but to point out the intervals which are 
dominated by the speech signal. 

The invention will be explained more fully by the follow- 
5 ing description of a preferred embodiment with reference 
to the drawing , in which 

fig. 1 is a block diagram schematically showing an appa- 
ratus according to the invention , 

10 

fig. 2 shows an example of an input signal consisting of a 
portion of a speech signal without noise, and how this 
signal is processed in the apparatus in fig. 1, 

15 fig. 2A shows the input signal, 

fig. 2B shows the frequency limited subsignals originating 
from filtering of the input signal, 

20 fig. 2C shows the envelope signals corresponding to the 
subsignals in fig. 2B, 

fig. 2D shows the synchonism signal from the synchronism 
detector as well as a threshold value with which it is 
25 compared, and 

fig. 2E shows the final classification signal from the 
threshold detector. 

30 In fig. 1 an electric input signal 101 consisting of a 
speech signal mixed with a noise signal (trafic noise, 
cafeteria noise, speech from other persons or the like) is 
passed to a filter bank 102 consisting of a plurality of 
optionally overlapping bandpass filters with increasing 

35 center frequency and covering in combination the entire 

frequency range of the speech signal or part thereof. Each 



WO 91/03042 



PCI7DK90/00214 



bandpass filter has a band width greater than twice the 
greatest expected value of the fundamental frequency of 
the speech signal,, so that a subsignal 103 comprising at 
least two consecutive harmonic frequencies to the funda- 
5 mental frequency can pass through each bandpass filter. 

The subsignals are passed to their respective envelope 
detectors 104, which form the time envelopes 105 for the 
subsignals 103 e.g. by means of rectification , squaring or 

10 analytical signals as well as optional subsequent low-pass 
filtering. This signal processing, which following band- 
pass filtering of the input signal generates and utilizes 
the envelopes of the bandpass filtered subsignals is known 
in other connections from the acoustic/audiological field, 

15 see e.g. (6). 

The envelope signals are passed to a synchronism detector 
106, which produces a measure of synchronism between the 
envelope signals 105 for a time segment of the signals. 
20 Then, the time course of the computed synchronism has the 
shape of a staircase curve and is called the synchronism 
signal 107. 

The principle of the synchronism detector 106 may e.g. be 
25 based on correlation, an artificial neural network or 

another computing method applied to all or a subset of the 
envelope signals 105. For example, a correlation can be 
computed by first computing the product sum of the signal 
values for any pair of signals i.e. the envelope signals 
30 from two adjacent bandpass filters and then performing 
summation of all the computed product sums. 

Finally, the synchronism signal 107 is passed to a thres- 
hold detector 108 where the synchronism signal 107 is com- 
35 pared with a threshold value. If the synchronism signal 

107 is greater than the threshold value, the time segment 
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in question is classified as being dominated by speech, 
and the classification signal 109 is set to the value 
binary 1. If not, the classification signal 109 is set 
to the value binary 0, 

5 

The overall function of the synchronism detector 106 and 
the threshold detector 108 may also be implemented by 
means of either a trained, a self-organizing or other 
artificial neural network using the envelope signals 105 
10 as input signals and forming the desired classification 
signal 109 as output signal for classification of the 
mixed signal. 

Presence of a noise signal affects the classification more 

15 or less depending upon the characteristics of the noise 
signal. If the noise signal is stochastic, speech-like 
noise, the speech detection will by and large not be af- 
fected even with a very small signal /noise ratio. If, on 
the other hand, the noise signal is a signal with an in- 

20 herent modulation as a speech signal, or if it is a real 
speech signal from one or more persons, the interplay be- 
tween the actual signal/noise ratio and the construction 
of the threshold detector 108 will be of decisive impor- 
tance. When e.g. the threshold detector 108 is arranged 

25 such that the threshold value 210 with a given time con- 
stant adaptively adjusts itself corresponding to a given 
fraction of the size of the synchronism signal 107, then 
only the dominating speech signal will advantageously be 
detected. Removal of the lowest frequency components of 

30 the synchronism signal provides the additional advantage 

that a continuous noise signal consisting of harmonic fre- 
quency components (e.g. acoustic noise from a rotating 
machine), will not erroneously be classified as being a 
speech signal. 
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Fig. 2 shows an example of how a given input signal 201 is 
processed in the apparatus in fig. 1. To illustrate the 
fundamental principle of the invention the input signal 
201 is shown in fig. 2A as a short speech signal without 
5 noise consisting first of a (voiced) vowel and then of an 
unvoiced fricative. Fig. 2B shows the frequency limited 
subsignals 203 formed in the filter bank 102. Fig. 2C 
illustrates the envelope signals 205 formed by the enve- 
lope detectors 104 from the subsignals 203 in fig. 2B. At 

10 the vowel , the envelope signals 205 in several frequency 

bands are shown to be correlated with each other and modu- 
lated with a frequency corresponding to the fundamental 
frequency. At the fricative, the envelope signals 205 
show that short-term energy is present simultaneously in 

15 several frequency bands. Fig. 2D shows the synchronism 

signal 207 computed from the synchronism detector 106 as 
well as the threshold value 210 with which it is compared. 
Finally, fig. 2E shows the obtained classification signal 
209. 

20 

An apparatus according to the invention may be implemented 
either in analog or digital hardware or in software or in 
combinations thereof. 



30 



35 
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Patent Claims: 



1. A method of classifying, in a selected time interval, 

5 a mixed speech and noise signal (101, 201) as being signi- 
ficantly or insignificantly affected by the speech signal, 
where the mixed signal is divided into a plurality of se- 
parate, frequency limited subsignals (103, 203), cha- 
racterized in that 

10 

- each subsignal (103, 203) comprises at least two harmo- 
nic frequencies for a fundamental frequency of the speech 
signal, 

15 - the time envelope (105, 205) is generated for the sub- 
signals (103, 203), 

- a measure (107, 207) of synchronism between these enve- 
lopes (105, 205) is generated, and 

20 

- this measure (107, 207) is compared with a threshold 
value (210). 

2. A method according to claim 1, character- 
25 i z e d in that the mixed signal is divided into a plu- 
rality of time intervals in which the signal is classified 
successively. 

3. A method according to claim 1, character- 
30 i z e d in that the selected time interval is a running 

time window. 

4. A method according to claims 1-3, character- 
ized in that all envelopes are used for generating the 

35 measure (107, 207) of synchronism between the envelopes 
(105, 205). 
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5. A method according to claims 1-3, character- 
ized in that one or more subsets of the envelopes 
(105, 205) are used for generating the measure (107, 207) 
of synchronism between the envelopes (105, 205). 

5 

6. A method according to claims 1-5, character- 
ized in that the generation of the measure (107, 207) 
of synchronism between the envelopes (105, 205) is based 
on a correlation computation. 

10 

7. A method according to claims 1-5, character- 
ized in that the envelopes (105, 205) are passed as 
input signals to an artificial neural network which clas- 
sifies the signal. 

15 

8. An apparatus for classification of a mixed speech and 
noise signal (101, 201), comprising filter means each of 
which permits passage of a subsignal (103, 203), cha- 
racterized in that 

20 

- each subsignal (103, 203) contains at least two harmo- 
nic frequencies for a fundamental frequency for the speech 
signal, and that the apparatus moreover comprises 

25 - means (194) for generating the time envelopes (105, 
205) of the subsignals, 

- means (106) for generating a measure (107, 207) of 
synchronism between these envelopes, as well as 

30 

- means (108) for comparing the synchronism signal (107, 
207) with a given threshold value (210). 
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